Introduction to R and RStudio

Introduction

General information

A programming language is a communication code between a human and a machine (usually a computer). This allows you to give instructions to the computer. And the computer is very wise, it stupidly carries out each instruction that we give it.
There is a huge number of programming languages and they evolve.

R is a programming language that allows you to:

  1. manipulate data: import, transform, export, etc.
  2. carry out more or less complex statistical analyses: description, exploration, modelization…
  3. create (pretty) figures
  4. and much more !

Features:

  • available on mainstream OS’es (Windows, Mac and Linux).
  • free and open source (GNU).
  • large user community / online help.
  • numerous additional packages for any purpose.

History:

  • 1993/08 : First official release of R as a binary.
    • Written by Ross Ihaka and Robert Gentleman, simply aimed at being a programming language to teach introductory statistics at the University of Auckland.
    • Inspired by (and partially compatible with) the pre-existing, S programming language (BELL Labs, TIBCO Software).
  • 1997/04 : foundation of the Comprehensive R Archive Network, as the R central packages repository.
  • 1997/12 : release of R v0.6.0 sources under the GNU licence.
  • 2000/02 : release of R v1.0.0.
  • 2024/04 : realase of R v4.4.0.
Details about implementations of R language

WARNING : R is not R ! The R software you will use is NOT the R language itself ! It is an implementation of the language, and other implementations co-exist(ed). By example : Revolution Analytics (by Microsoft, now defunct), Renjin (active, Java), FastR (active, Java), Riposte (defunct), CXXR (defunct, C++) … <>

R versus Excel:

Fun fact (more awkward than fun, actually) about why R is better than excel:
In 2020, “Covid : UK loses thousands of cases due to … a saturated MS Excel file”

More information comparing R and Excel here.

The RGui

After installing R, you can launch it by double-clicking on the R icon logo_R.png

RGui

An interface is software placed between you and the computer that allows you to communicate more easily with the computer.

As you can see, the default R interface (RGui console : GUI stands for Graphical User Interface) is quite abrupt and bleak :S …

As such, it is warmly recommended to use additional software that will function as a graphical interface between you and R. This interface is a kind of shell that makes R work in the background. Several graphical interfaces have been developed, but the most used and practical is RStudio-desktop.

Rstudio

logo_Rstudio

First sight

rstudio RStudio displays 4 large panes by default. Their position may be changed based on your version and preference, but here are the default ones :

Pane names and postions
LEFT RIGHT
UPPER Script pane Environment/History pane
LOWER Console pane Help/Plots/Files

NOTE : your four panes may be blank while most of mines are filled with text in the illustration, do not panic ! We’ll come on that later.

The console pane (lower left)

console_pane_image

This is a simple interactive R console, like in the RGui (see previous section), that allows us to communicate with the computer.

WARNING: Here is an example provided by the bioinformatics platform BiGR. Your local RStudio might differ: the version of R, the list of available packages, etc. On your local machine, RStudio console will match with the RGui.

Here, I gave an instruction to R, and he told me that he was unable to execute it with an “Error” message. We communicate.

Let’s try to enter the print() command :

print("Hello World")
[1] "Hello World"

Here, we just used a function (which is an instruction), called print. This function attempts to display on screen anything provided between parenthesis ( and ). In our example, we provided the character string "Hello World" (thus, we used quotes), and the function print successfully printed it on screen !
The information that is put between the parentheses of a function is called arguments.

Now, click on Session -> Save Workspace As… and name it as you wish (I named mine “my_session.RData”). This will save your current work space (the environment in which you are working : the custom objects and functions loaded into memory). What happened in the console pane? You nailed it! A command has been automatically written. I named the file “my_session.RData”, so for me, it is:

save.image("my_session.RData")

This is one of the many ways RStudio helps you in your work, by simplifying (mainly using clicks and buttons) some tedious, regular parts.

As a general recommendation, you should get to save your work space after any important step in your code. When you may need help, whether on a function error, a script result or anything alike, you would be able to share such save to your favorite R-developer next door. This file contains everything you did in your current session.

Note that you can create your own functions in the future.

A package is a collection of R functions that are not natively present in R.

NOTE: The Rstudio console allows auto-completion and code suggestion : when one uses the [TAB] key after writing the first few characters of a code function/variable, the console proposes possible names from a dictonary that fits your installation. Any reputable bioinformatician uses auto-completion, but shhh ! It’s one of our many secrets !

The environment/history pane (upper right)

This pane has three main tabs: ‘Environment’, ‘History’ and ‘Connections’ (the ‘Tutorial’ tab was recently added, and a ‘Git’ tab may appear when one uses code in a version-controlled environment).

envhit_pane

Environment

‘Environment’ displays the list of every single variable, custom function, object or data loaded in R. This includes only what you defined by yourself and does not include environment variables nor base nor package-loaded functions.

A variable allows you to store information in memory, by giving it a name (like a box containing an information, and a label). An object is a more complexe variable that allows you to save several information inside one name (see R - Basics section).

Example : in your console pane, enter the following command:

my_var <- 0  # May also be written my_var = 0

What happened in the ‘Environment’ pane ?

You nailed it : a new my_var variable is now available in your environment !

env_my_var

When a more complex object is declared in your work space, some general information may be displayed, too. By example :

df <- data.frame("a"=c(1, 3), "b"=c(2, 4))

You can observe this dataframe (something like a table, we’ll see it later).

Click on its name to get a preview of its contained data. Then, click on the light-blue downwards arrow to have a deeper insight of its content:

df_expanded_env

Now click on Session -> Clear Work space

… and watch how your objects/variables disappeared !

This action cannot be undone ! While it is useful to clear one’s work space from time to time (in order to avoid name space collisions, release a bit of used RAM, etc), it is way better to save your work space before.

One again, using this “clickable” way is just an assist from RStudio : there are specific (written) commands to remove objects, with more manual control.

History

goto_history

history_theme

This tab is quite mandatory when coding and testing : while you test and search in the console, the history keeps a track of each command line you typed. This will definitely help you to build your scripts, to pass your command lines to coworkers, try and track code variations, and to revert possible unfortunate errors.

Each history is related to a session. You may see many commands in your history, even some you never entered manually : when a RStudio “assist” menu/button is used (ex : knitting commands, display commands, help commands, etc…), in most case it auto-runs a R command that will be stored in the history as well as the one you manually entered.

Please note that your history has a limit (512 entries by default) and only saves the latest command lines. The default value can be changed (but it’s something to reserve to more advanced users)

The help/plots/packages/files pane (lower right)

This pane has four main tabs: ‘Files’, ‘Plots’, ‘Packages’ and ‘Help’ (the ‘Viewer’ tab is rarely used, and ‘Presentation’ was added very recently).

helpFile_pane

Help

This is maybe the most important/useful pane of your R Studio, on a user POV. THIS is the difference between R Studio and another code editor. Search for any function here, locally and not on the internet. This pane shows you the available help for YOUR version of R, YOUR version of a given package.

Effectively, a published function for which a help page does exist in a version of R or one of its packages, still can evolve in time, depending on its author. Thus, some parameters can be added, removed, renamed, and/or default values changed. Consequently, the help from a remote source (internet) may be inadequate for your current, installed version !

When copy-pasting commands from a remote source, please assure yourself the source is clean (ie, nothing harmful for your system) and compatible with your installed version (this is a major source of error).

Never ever copy code from the internet right to your console Why? Example: https://www.wizer-training.com/blog/copy-paste

Files

Just like any system file explorer, we can move across directories, create folders and files, delete them, etc… from within Rstudio.

Initially, this file explorer is set at your current working position. This means that when interacting with directories or files without expressing an absolute path, Rstudio will consider directories and files that are stored where you are in this panel

NOTE : this is not a theoretical representation : performing modifications here will modify the file structure on your drive !

Create a directory

Here, we will create a new directory, using two methods :

  1. With the GUI

create_working_dir_rstudio

  1. With a command line :

Use the dir.create() function :

dir.create("Intro_R")

Change working directory

One can change one’s working directory :

  1. With the GUI :

setwd

  1. With a command :

Use setwd():

setwd("Intro_R")

Delete files

You can delete files:

  1. With the GUI

download

  1. With a command

Use the file.remove() function:

file.remove("annotation.csv")
file.remove("expression.txt")

NOTE: There is nothing such as a “Recycle bin” here ! Deleted files or directories are wiped !

Packages

Here are listed all locally installed packages (not available packages), with a description and a version number.

package_pane

You will get more information about packages in an upcoming subsection.

Plots

If you work with R scripts that performs plots, the generated graphs will be displayed here !

These can sometimes be interactive plots (ability to zoom, scroll, etc…), depending on the functions used.

The script pane (upper left)

script_pane

This is where you will spend most of your time when writing your R scripts (next to the ‘Help’ pane !).
A script is a text file, where you save your command line to use, in the order of using.
This pane also accepts other languages (e.g. bash, python, …) or different R flavors (R markdown, by example), but R Studio shines for its R integration, oviously.

Please, please ! Write your commands in the ‘Script’ pane, then execute them (hitting [CTRL] + [Enter] with the cursor placed on the command to run), rather than directly writing them in the console ! This has only advantages : you can track, save, share your work, test variations, without relying on your (spoiler : imperfect) memory, …

The default file extension for a R script saved is .R or .r (or .RMD, .Rmd for Rmarkdown scripts), for example my_script.R.

script_pane

NOTE : The file name here appears in red and ending with a star ’*’ because some content was written in the script but not saved on disk.

TLDR – Too Long Didn’t Read

Graphic interface presentation :

  1. Write command lines in the ’Script’** pane (upper left)
  2. Execute command lines by hitting [CTRL] + [Enter] from the script pane, then observe their output in the ‘Console’ / ‘Plots’ panes.
  3. Take a look at your environment and history in the upper right pane
  4. Search for help in the lower right pane.

R – Basics

Variables and types

Remember : a variable is a name given to any value of any type stored in memory.

Here, we will describe basic types.
A type is a data type. It can be an integer, a decimal number, a string, an array, etc.

Number

By example, 3, the number three, exists in R and is understood as it by R. You can store this value in a variable using the arrow assignment operator <-:

three <- 3

In the above code, the number 3 is stored in a variable called “three” (it could be any other name, as long as it avoids special characters).

You can do this in R with quite anything. Literally anything. Whole files, pipelines, images, anything.

# The example below is a very good example of
# how to never ever name a variable:
<- "happy"

Maths in R works the same as your regular calculator :

3 + three # Add
[1] 6
1 - 2 # Subtract
[1] -1
4 / 2 # Divide
[1] 2
3 * 4 # Multiply
[1] 12
7 %/% 2 # Floor division
[1] 3

Info: # is the way to write a comment into your script. Any instruction after #, and as long as no line return is invoked, will be ignored by the R console. For example in 3 + three # Add, 3 + three will be executed, but # Add will not. A good coder uses comments, and a lot of them, and relevant ones, to explain the calculation one made !

NOTE : Here, we simplified how R is actually handling numbers. There are two main subtypes in numbers : 1) Integers 2) Numerics (called “floats” in most other languages).

Character

Characters correspond to letters and letter chains, and are delimited with quotes : be it single ' or double " (as long as the same type is uesd in a pair : on cannot start the definition of a character with a single quote and end it with a double one !) :

four <- "4"
five <- '5'

Mathematics do not work with characters at all … Try the following:

"4" + 1
four + 1
Computer answer

Error in “4” + 1 : non-numeric argument to binary operator

<>

You can try to turn characters in numbers with the function: as.numeric():

as.numeric("4") + 1
[1] 5
as.numeric(four) + 1
[1] 5

Function

A function is a R piece of code that can perform more complex work. It is called using a command name, that is then followed by parentheses ( and ). Between these parentheses, we enter arguments that are expected by the function.

Use the help pane to get information about the names of arguments expected and/or understood by a given function, and the expected type of their value.

As previously described, you can store any of the previously typed commands into a variable:

five <- as.numeric("4") + 1
print(five)
[1] 5

Please! Please! Give your variable explanatory names, so that its content is understandable by humans. I would be pissed off seeing any of you calling their variable “a”, “b”, “xyz,”my_awsome_var”, “dunno_what_it_is”, “thing1”, “thing2”, etc …

Question :

I have two numbers stored in these variables :

  • mysterious_number_7
  • suspicious_number_7

When I apply the print() function on them, it returns in both cases 7.

They are both numeric.

However, they are not equal …

Why ?

Code to check question facts
# Show the value of the variable mysterious_number_7
print(mysterious_number_7)
[1] 7
# Show the value of the number suspicious_number_7
print(suspicious_number_7)
[1] 7
# Check that mysterious_number_7 is a number
is.numeric(mysterious_number_7)
[1] TRUE
# Check that suspicious_number_7 is a number
is.numeric(suspicious_number_7)
[1] TRUE
# Check that values of mysterious_number_7 and suspicious_number_7 are equal
mysterious_number_7 == suspicious_number_7
[1] FALSE
# Check that values of mysterious_number_7 and suspicious_number_7 are identical
identical(mysterious_number_7, suspicious_number_7)
[1] FALSE

We will talk about differences between equality and identity later.

Answer

This is due to the number of digits displayed in R. You are very likely to have issues with that in the future, as all (bio)informatician around the world.

mysterious_number_7 <- 7.0000001
suspicious_number_7 <- 7

You can change the number of displayed digits with the function options():

options(digits=8) #allows you to print up to 8 digits
print(mysterious_number_7)
[1] 7.0000001


Logical

Aside from characters and numbers, there is another very important type in R (and computer science in general): Logicals, which corresponds to the boolean types. There are two logicals : TRUE and FALSE.

3 > 4
[1] FALSE
5 < 10
[1] TRUE
5 == 10
[1] FALSE

Data structures

Until now, we have seen simple information stored into a variable. But we can create more complex structures in order to store multiple information into a single variable.

data_structure

NOTE : In your usage of R, you may cross R objects. These correspond to a special type of complex objects, that can store mutiple data of different types, but also obey rules to allow users to interact with (by example, getters and setters, to get or set information from and to it)

Vector

You can make vectors and tables in R. Don’t panic, there will be no maths in this course : vectors or a one-dimensional series of values of the same type.

In R, vectors can be created with the c() function (“c” for “concatenate”) :

one2twenty <- c("1", "2", "3", "4", "10", "20")
print(one2twenty)
[1] "1"  "2"  "3"  "4"  "10" "20"

One can check if a variable is a vector :

is.vector(one2twenty)
[1] TRUE

As a uni-dimensional object, one can get its length :

length(one2twenty)
[1] 6

One can select an element of the vector using squared brackets [ and ] holding the desired element index :

one2twenty[1] # select the first element
[1] "1"

NOTE : R index starts at the value 1 (many other languages start at 0)

one2twenty[5] # select the fifth element
[1] "10"

One can select multiple elements of a vector with a ranged index ::

one2twenty[2:4] # select from second to fourth element
[1] "2" "3" "4"

One can combine both ways :

one2twenty[1,3:5] # select the first and third to fifth elements

Question 1: Is there a difference between these two vectors ?

c_vector <- c("1", "2", "3")
n_vector <- c( 1,   2,   3 )
Answer

There is a difference indeed: c_vector contains characters, n_vector contains numerics.

print(c_vector)
[1] "1" "2" "3"
print(n_vector)
[1] 1 2 3
print(is.numeric(c_vector))
[1] FALSE
print(is.numeric(n_vector))
[1] TRUE
identical(c_vector, n_vector)
[1] FALSE

You can always use the identical() function to test identity with robustness and exactitude.

You may have learned about the operator == for equality. But this is not perfect, look at our example:

c_vector == n_vector
[1] TRUE TRUE TRUE

The operator == is not aware of data types.

Another example, mixing numeric and booleans:

1 == TRUE
[1] TRUE
identical(1, TRUE)
[1] FALSE

In computer sciences, there is a historical reason why booleans and integers are mixed.

More information about that This is linked to the history of computing languages. Prior to the introduction of an actual boolean type (TRUE and FALSE), the integers 0 and 1 were the official representation for true and false values, similar to what is used in C-89 (the first standardized version of the C programming language, in 1989).
To avoid unnecessarily breaking imperfect but working code, the new boolean type needed to work just like 0s and 1s. This goes beyond merely truth value, but all integral operations. No one would recommend using a boolean results in a numeric context, nor would most people recommend testing equality to determine truth value, as well as no one wanted to find out the hard way just how much existing code is that way. Thus, the decision to make True and False mimics for 1 and 0, respectively. This is merely a historical artifact of the linguistic evolution.


Question 2: Can I include both text and numbers in a vector ?

mixed_vector <- c(1, "2", 3)
Answer

YES and NO. The example shows that we technically can, we end with an unexpected results : Here, all our values have been turned into characters, as we can not mix types in a vector. Either all its content is made of number or all its content is made of characters.

print(mixed_vector)
[1] "1" "2" "3"
print(is.numeric(mixed_vector))
[1] FALSE
print(is.character(mixed_vector))
[1] TRUE


Question 3: How to create a histogram from a vector ?

Help A simple way to visualize your data is to use a graph. The hist() function may help you (of course, use the Help pane!!).
Answer
hist(c_vector)

Error in hist.default(c_vector) : ‘x’ must be numeric

Errr… Why is this command not working ?

The error says : “‘x’ must be numeric”. The function only accepts a vector made of numeric values.

hist(n_vector) # worked perfectly !


Data Frame

In R, a table is stored as a data.frame. A way to create it from scratch using the data.frame() function :

one2three4 <- data.frame(col1 = c(1, 3), col2 = c(2, 4))
print(one2three4)
  col1 col2
1    1    2
2    3    4

By default, for dataframes, R accepts names for columns and rows. You can rename columns and row names with the colnames() and rownames() functions, respectively.

colnames(one2three4) <- c("Col_1_3", "Col_2_4")
rownames(one2three4) <- c("Row_1_2", "Row_3_4")
print(one2three4)
        Col_1_3 Col_2_4
Row_1_2       1       2
Row_3_4       3       4

You can access a column and a line in the data frame using an index for each dimension into squared brackets [ and ].

R expects first an index for rows, then columns : data_frame[row_index, column_index]. You can either use the name of row(s)/column(s) or their position index.

If one wants all values for one of the dimensions, one should just leave the bracket part empty.

# Select a row by its name
print(one2three4["Row_1_2", ])
        Col_1_3 Col_2_4
Row_1_2       1       2
# Select a row by its index
print(one2three4[1, ])
        Col_1_3 Col_2_4
Row_1_2       1       2
# Select a column by its name
print(one2three4[, "Col_1_3"])
[1] 1 3
# Select a column by its index
print(one2three4[, 1])
[1] 1 3
# Select a cell in the table
print(one2three4["Row_1_2", "Col_1_3"])
[1] 1
# Select the first two rows and the first column in the table
print(one2three4[1:2, 1])
[1] 1 3

If you like maths, you will recall the [row, column] order. If you’re not familiar with that, you probably will do like 99% of all software engineers : write in the wrong [column, row] order in your first interactions with R. Subsequently, you will raise an error. Trust me. 99%, easy. And just remember than an error is never an ending point in informatics.

Question 1 : Can I mix characters and numbers in a data frame row ?

Answer

Indeed, it is possible !

mixed_data_frame <- data.frame(
  "Character_Column" = c("a", "b", "c"),
  "Number_Column" = c(4, 5, 6)
)
print(mixed_data_frame)
  Character_Column Number_Column
1                a             4
2                b             5
3                c             6

The str() function can be used to look at the types of each elements in an object.

str(mixed_data_frame)
'data.frame':   3 obs. of  2 variables:
 $ Character_Column: chr  "a" "b" "c"
 $ Number_Column   : num  4 5 6
str(one2three4)
'data.frame':   2 obs. of  2 variables:
 $ Col_1_3: num  1 3
 $ Col_2_4: num  2 4


Question 2 : Can I mix characters and numbers in a data frame column ?

Answer

Unfortunatelu, no, you can’t :

mixed_data_frame <- data.frame(
  "Mixed_letters" = c(1, "b", "c"),
  "Mixed_numbers" = c(4, "5", 6)
)
print(mixed_data_frame)
  Mixed_letters Mixed_numbers
1             1             4
2             b             5
3             c             6
str(mixed_data_frame)
'data.frame':   3 obs. of  2 variables:
 $ Mixed_letters: chr  "1" "b" "c"
 $ Mixed_numbers: chr  "4" "5" "6"

All this is because the data.frame is a special sort of bidimensional object in R structured by column, where each column can store a different type than the other.


Read a table as data frame

Exercise: Use the Help pane to find how to use the read.csv() function.

You can find the example_table.csv input file here. Download it by clicking on the [Download raw file] button.

Use the read.csv() function to:

  1. open the file example_table.csv.
  2. this table has a header (TRUE). A header is a title line that defines column names.
  3. this table has row names in the column called “Gene_id”.
  4. as a CSV file, its field separator is a comma ,.

Let all other parameters to their default values.

Save the opened table in a variable called example_table.

Answer
example_table <- read.csv(file="example_table.csv", 
                          header=TRUE, 
                          row.names="Gene_id",
                          sep = ","
                 )

Now let us explore this dataset.

We can click on the ‘Environment’ pane:

see_in_the_env_pane

And if you click on it :

open_example_table

Be careful ! Visualizing a large table may hang your session (by consuming to much RAM).

Alternatively, we can use the head() function, which prints the first lines of a table:

head(example_table)
        Sample1   Sample2   Sample3   Sample4
Caml   9.998194 10.004116  9.172489  9.139667
Scamp5 9.995917 10.818685 11.417558 14.907892
Dgki   9.993974 13.664396 16.132275 17.420057
Mas1   9.993956 11.370854 11.233629  9.912863
Apba1  9.992540 14.253438 14.001228 13.654701
Phkg2  9.980898  8.748654  8.714821  9.146529

The summary() function describes the dataset per sample (per column) :

summary(example_table)
    Sample1           Sample2            Sample3            Sample4        
 Min.   : 9.9437   Min.   :  6.8385   Min.   :  5.5512   Min.   :  5.8437  
 1st Qu.: 9.9526   1st Qu.:  9.0000   1st Qu.: 10.1196   1st Qu.:  9.7785  
 Median : 9.9710   Median : 10.9544   Median : 11.3256   Median : 11.9052  
 Mean   :18.9372   Mean   : 19.8355   Mean   : 20.8277   Mean   : 21.4123  
 3rd Qu.: 9.9940   3rd Qu.: 12.6467   3rd Qu.: 12.6499   3rd Qu.: 13.9677  
 Max.   :99.7837   Max.   :105.0774   Max.   :112.1882   Max.   :111.8205  

Have a look at the summary() of the dataset per gene, using the t() function to transpose:

head(t(example_table))
             Caml    Scamp5      Dgki      Mas1     Apba1    Phkg2    Timm8b
Sample1  9.998194  9.995917  9.993974  9.993956  9.992540 9.980898  99.78373
Sample2 10.004116 10.818685 13.664396 11.370854 14.253438 8.748654 105.07739
Sample3  9.172489 11.417558 16.132275 11.233629 14.001228 8.714821 112.18819
Sample4  9.139667 14.907892 17.420057  9.912863 13.654701 9.146529 109.09544
            Capn7     Yrdc    Coq10a   Gm27000    Lrrc41    Acadsb    Pdzd11
Sample1  9.976005 9.971093  9.970835  9.965511  9.960667  9.959179  9.952750
Sample2 11.314599 8.905508  8.820582  7.414795  9.961954 11.261520  9.031553
Sample3 11.452421 7.367243 10.449131  7.709008 10.435298 12.336088 10.700876
Sample4 11.692871 9.375526 10.865062 13.126211  9.137375 12.703318 10.832218
          Smarca2    Gm26079     Ptpn5     Rexo2     Ifi27   Snhg20
Sample1  9.952224  99.514659  9.947524  9.946340  9.943989 9.943724
Sample2  9.272424 103.089626 11.090058 13.363912 12.407626 6.838499
Sample3 11.194709 109.856535 11.572261 11.477445 13.591186 5.551247
Sample4 12.117571 111.820504 10.255021 12.292877 14.906542 5.843670
summary(t(example_table))
      Caml             Scamp5             Dgki             Mas1        
 Min.   : 9.1397   Min.   : 9.9959   Min.   : 9.994   Min.   : 9.9129  
 1st Qu.: 9.1643   1st Qu.:10.6130   1st Qu.:12.747   1st Qu.: 9.9737  
 Median : 9.5853   Median :11.1181   Median :14.898   Median :10.6138  
 Mean   : 9.5786   Mean   :11.7850   Mean   :14.303   Mean   :10.6278  
 3rd Qu.: 9.9997   3rd Qu.:12.2901   3rd Qu.:16.454   3rd Qu.:11.2679  
 Max.   :10.0041   Max.   :14.9079   Max.   :17.420   Max.   :11.3709  
     Apba1             Phkg2            Timm8b            Capn7       
 Min.   : 9.9925   Min.   :8.7148   Min.   : 99.784   Min.   : 9.976  
 1st Qu.:12.7392   1st Qu.:8.7402   1st Qu.:103.754   1st Qu.:10.980  
 Median :13.8280   Median :8.9476   Median :107.086   Median :11.384  
 Mean   :12.9755   Mean   :9.1477   Mean   :106.536   Mean   :11.109  
 3rd Qu.:14.0643   3rd Qu.:9.3551   3rd Qu.:109.869   3rd Qu.:11.513  
 Max.   :14.2534   Max.   :9.9809   Max.   :112.188   Max.   :11.693  
      Yrdc            Coq10a           Gm27000            Lrrc41       
 Min.   :7.3672   Min.   : 8.8206   Min.   : 7.4148   Min.   : 9.1374  
 1st Qu.:8.5209   1st Qu.: 9.6833   1st Qu.: 7.6355   1st Qu.: 9.7548  
 Median :9.1405   Median :10.2100   Median : 8.8373   Median : 9.9613  
 Mean   :8.9048   Mean   :10.0264   Mean   : 9.5539   Mean   : 9.8738  
 3rd Qu.:9.5244   3rd Qu.:10.5531   3rd Qu.:10.7557   3rd Qu.:10.0803  
 Max.   :9.9711   Max.   :10.8651   Max.   :13.1262   Max.   :10.4353  
     Acadsb            Pdzd11           Smarca2           Gm26079       
 Min.   : 9.9592   Min.   : 9.0316   Min.   : 9.2724   Min.   : 99.515  
 1st Qu.:10.9359   1st Qu.: 9.7225   1st Qu.: 9.7823   1st Qu.:102.196  
 Median :11.7988   Median :10.3268   Median :10.5735   Median :106.473  
 Mean   :11.5650   Mean   :10.1293   Mean   :10.6342   Mean   :106.070  
 3rd Qu.:12.4279   3rd Qu.:10.7337   3rd Qu.:11.4254   3rd Qu.:110.348  
 Max.   :12.7033   Max.   :10.8322   Max.   :12.1176   Max.   :111.821  
     Ptpn5             Rexo2             Ifi27            Snhg20      
 Min.   : 9.9475   Min.   : 9.9463   Min.   : 9.944   Min.   :5.5512  
 1st Qu.:10.1781   1st Qu.:11.0947   1st Qu.:11.792   1st Qu.:5.7706  
 Median :10.6725   Median :11.8852   Median :12.999   Median :6.3411  
 Mean   :10.7162   Mean   :11.7701   Mean   :12.712   Mean   :7.0443  
 3rd Qu.:11.2106   3rd Qu.:12.5606   3rd Qu.:13.920   3rd Qu.:7.6148  
 Max.   :11.5723   Max.   :13.3639   Max.   :14.907   Max.   :9.9437  
To go further
# number of columns
ncol(example_table)
[1] 4
# number of rows
nrow(example_table)
[1] 20
# get dimensions (number of rows and number of columns)
dim(example_table)
[1] 20  4
# type of each element
str(example_table)
'data.frame':   20 obs. of  4 variables:
 $ Sample1: num  10 10 9.99 9.99 9.99 ...
 $ Sample2: num  10 10.8 13.7 11.4 14.3 ...
 $ Sample3: num  9.17 11.42 16.13 11.23 14 ...
 $ Sample4: num  9.14 14.91 17.42 9.91 13.65 ...

TLDR – Too Long Didn’t Read !

# Declare a variable, and store a value inside :
three <- 3

# Basic maths operators: + - / * work as intended:
six <- 3 + 3

# Quotes are used to delimit characters:
seven <- "7"

# You cannot perform maths on characters :
"7" + 8 # raises an error
seven + 8 # also raises an error
six + 8 # works fine

# R makes the most to help you. You can change the type of your variable with:
as.numeric("4") # the character '4' becomes the number 4
as.character(10) # the number 10 becomes the character 10

# You can compare values, returns a logical :
six < seven
six + 1 >= seven
identical(example_table, mixed_data_frame)


# You can load and save a data.frame as/from a text file :
read.table(file = ..., sep = ..., header = TRUE)
write.table(x = ..., file = ...)

# Create a table with:
my_table <- data.frame(...)

# Create a vector with:
my_vector <- c(...)

# You can read the first lines of a data.frame :
head(example_table)

# Search for help in the 'Help' pane or with:
help(...)

R – Packages

What are modules and packages

Modules and package are considered to be the same thing in this lesson. The difference is technical and does not relate to our scope.

Most of the work you are likely to do with R will require one or multiple additional packages. A package is a list of functions, pipelines, or datasets shipped under a given name. In general, a package groups together functions linked to an analysis scheme or theme towards a defined aim. Every single function you use through R comes from a package or another. Those used till now in this lesson come mostly from the ‘base’ package (R is shipped with a short list of mandatory packages)

Invoke the help page for the print function, and read the very first line of the ‘print’ pane :

help(print)

It reads: print {base} : The function print comes from the package base.

# Call the function "print", with the argument "You're the best!"
print("You're the best!")
[1] "You're the best!"

WARNING : Sometimes, two packages may share the same name for different functions ! They are most certainly not doing the exact same thing. IMHO, it is a good habbit to ALWAYS call a function while disambiguating the package name, using thins syntax : package::function. Writing base::print() is better, clearer than using print() alone.

# Call the function "help" ***from the package utils***, with the argument "example_table", and show only the first line
base::print("You're the best!")
[1] "You're the best!"

Install a package

Your work will probably require the installation of a new package from an external source.

R can install packages from multiple different sources and repositories, but by default, it installs from the CRAN repository.

Use install.packages() to install a package.

# Install a package with the following function
install.packages("dplyr")

This will raise a prompt asking for simple questions : from which mirror site to download (choose somewhere in France), whether to update other packages or not, etc.

Do not be afraid by the large amount of things prompted in the console and let R do the trick. These are messages from the different steps required by to safely install a package and test if it’s perfectly installed and usable. It may also take time, especially when the requested package to install requires other packages (dependencies), that may themselves require other packages (etc…). Almost any code relies on another code !

Alternatively, you can click on [Tool] -> [Install Packages] in RStudio; or click on the [install] button in the ‘Packages’ tab of the ‘File/Help’ pane.

You can list installed packages with installed.packages(), and find for packages that can be updated with old.packages().

These packages can be updated with update.packages().

While the install.packages() function searches packages in the common R package list, many bioinformatics-related packages are available on other shared packages warehouses. Just like your AppStore and PlayStore do not have the same applications on your cellular, R has multiple sources for its packages. As someone invested in biology who wants to use R, in addition to the default CRAN repository, you need to know another one : Bioconductor.

bioconductor

You can install a package from Bioconductor with the BiocManager::install() function :

# Install BiocManager, a package to use Bioconductor, from CRAN
install.packages("BiocManager")

# Install a package from Bioconductor
BiocManager::install("DESeq2")

Hopefully, BioConductor knows that the CRAN exists, so when one requests the installation of a BioConductor package that depends on CRAN packages, the installer will get them without any human intervention.

Use a package

An installed package is not actually active by default (you can’t use its functions just because it is installed). When you want to use something from an installed package, you need to invoke it first.

You can load a package with the library() function :

library(package="dplyr") # or just library(dplyr)

If the requested package is not locally installed, this will raise an error.

If there is no error message, it implies that the package is loaded.

When a package is loaded, all its content (functions, objects, datasets, …) are accessible and loaded into RAM.

NOTE 1 : when a package is loaded, its box in the ‘Packages’ pane is ticked.

NOTE 2 : There is an exception to the package invocation requirement : when one calls a function with disambiguation syntax package::function(), only the requested function is loaded into memory for the duration of its exectution, then detached.

Then you can try :

help(topic="arrange", package="dplyr")

And search for help about how to run your command.

Alternatively, there is a more complete help page at the package, reached with the browseVignettes() function. It opens in your browser automatically (or in the Rstudio browser, depending on your configuration), and if you click on “HTML”, you get some information about the package like functions, tutorials, etc.

NOTE : However, such a vignette is facultative in a R package : its presence only depends on the will of the package coder(s).

browseVignettes(package="dplyr")

TLDR – Too Long Didn’t Read

# Install a CRAN package
install.packages("BiocManager")

# Load a package
library("BiocManager")

# Install a BioConductor package
BiocManager::install("DESeq2")

# Get help
browseVignettes(package="DESeq2")

Tips for your project

Write a good script

Good practices (for your sake, and the one of any future reader of your code) :

  • write a documentation (a header at the start of the script which explains the purpose of the script, an explanation of the parameters you setup, and the analysis steps, for example)
  • use comments (uninterpreted line, starting with #)
  • use code indentation (spaces before code line that shows their hierarchical structure)
  • use humanly-understandable variable names
  • do not nest too many functions inside each other, this will soon be a mess
### BAD ; difficult to understand
    print(rowMeans(data.frame(c(9, 14, 17, 9, 13),
c(11, 10, 20, 7, 17),c(15, 8,      19, 10, 15)   ))       )

### GOOD : easy to understand
## Goal: this script computes the mean of the expression of our 3 samples for each gene:
#create a dataframe with the genes expression of our 3 samples:
example_data_frame <- data.frame("Expression_Sample_1" = c(9, 14, 17, 9, 13),
                                 "Expression_Sample_2" = c(11, 10, 20, 7, 17),
                                 "Expression_Sample_3" = c(15, 8, 19, 10, 15)
                      )
#add corresponding genes names into row names:
rownames(example_data_frame) <- c("Caml", "Scamp5", "Dgki", "Mas1", "Apba1")
#compute the mean of the expression for each gene:
mean_expression_Samples123 <- rowMeans(example_data_frame)
#print the result:
print(mean_expression_Samples123)
  • save your script frequently, as well as your working environment
  • save the versions of the loaded packages at the end of your analysis (you can print loaded packages thanks to the sessionInfo() function and save the result into an on-disk file thanks to the capture.output()) function.
sessionInfo() #displays name and version of loaded packages in the console
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
[1] C

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.35        R6_2.5.1             bookdown_0.39       
 [4] fastmap_1.1.1        xfun_0.43            cachem_1.0.8        
 [7] knitr_1.46           htmltools_0.5.8.1    rmarkdown_2.26      
[10] lifecycle_1.0.4      cli_3.6.2            rmdformatsbigr_1.0.0
[13] sass_0.4.9           jquerylib_0.1.4      compiler_4.4.0      
[16] highr_0.10           tools_4.4.0          evaluate_0.23       
[19] bslib_0.7.0          yaml_2.3.8           jsonlite_1.8.8      
[22] rlang_1.1.3         
utils::capture.output(sessionInfo(), file = "sessionInfo.txt") #save them in a file

Load and save R objects

While working on your projects, you will process datasets in R. The results of these analyses will be stored in variables. All these are stored in memory, which is volatile. This means, when you close RStudio, any unsaved object will be lost.

We already depicted the save.image() function to save a copy of your complete working environment.

However, you can save only the content of a given variable, solely. This is useful when you want to save the result of a function (or a pipeline) but not the whole 5 hours of trials and error work you’ve been spending on how-to-make-that-bloody-pipeline-work-correctly.

The (compressed) format to save an object is called RDS for R Data Serialization. This is done using the saveRDS() function :

saveRDS(object = example_table, file = "example_table.RDS")

Hopefully, you can load a RDS content into a variable ! This is useful when you receive a RDS from a coworker, or you’d like to keep going your work from a saved point. This is done with the readRDS() function :

example_table <- readRDS(file = "example_table.RDS")

NOTE : You can see in the example above that loading an object save as a RDS requires to inject it in a variable. This means that the RDS contains the object, but unnamed. This is not the case when saving an environment (using save()), as all loaded objects are saved, thus their respective names are kept.

Human data

WARNING : If you hold human-related genomic datasets that can allow person identification (ie, sequences), you cannot use/upload these data anywhere without control. This is strictly illegal, and such behavior may take your for up to 5 years in jail, assorted to a 300 000€ fine. Art. 226-16, Section 5, Code pénal.

Packages update

It is a good practice to maintain package versions within a work project. If you update a package (whether by need, or by will), then you should restart your work from the start. This stands as long as you’re not 100% sure the update does not affect your results.

The Swirl package

What if I would like to pursue learning R by myself, when I can or want ?

swirl_new_large_final

What is swirl?

swirl is an R package that teaches you R programming and data science interactively, at your own pace, and straight into the R console.

It presents a choice of course lessons and interactively tutors a student through them, with multiple levels of complexity. A student may be asked to watch a video, to answer a multiple-choice or fill-in-the-blanks question, or to enter a command in the R console precisely as if he or she were using R in practice. Emphasis is on the last, interacting with the R console. User responses are tested for correctness and hints are given if appropriate.

Progress is automatically saved so that one may quit at any time and later resume without losing anything.

Installation and usage

#install package
install.packages("swirl")
#load package
library(swirl)
#install the R course for 
install_course("R Programming")
#start the course
swirl()

Enjoy!

Other useful command lines for swirl usage
#quit swirl
bye()
#skip a question
skip()
#return to the main menu
main()
#allow experimentation in the R console without interference from swirl
play()
#to resume interacting with swirl
nxt()
#display a help menu
info()

Conclusion

No programming language is better than any other. They all serve different purposes. Anyone telling the opposite is (over)-specialized in the language they are advertising (and probably have a strong lack of objectivity.

In the field of bioinformatics, languages used by the community are quite limited. The main, widely adopted options are :

While learning bash cannot be escaped nowadays to interact with a HPC, it is not enough to perform a complete analysis with publication-ready figures and shaped results. You should get interest in another programming language: R and/or Python. R allows you to do a lot of different analyses, and it has a large user community with lots of online resources for help. As such, it’s one of the easiest languages for beginners.

Please, note that this advice is valid today, but may change. Mainy other programming languages are used, some have lost their place on the podium, and others are trying to supersede bash, R, and Python. An example is Julia

Thibault: “Anyway Python is the best programming language in the WORLD. Don’t listen to Bastien.”
Bastien: “Huh, Thibault.. Remember when one talked about objectivity ? :p”